SemanticScuttle - klotz.me » Tags: data extraction

Tags: data extraction*

0 bookmark(s) - Sort by: Date ↓ / Title /

I Replaced GPT-4 with a Local SLM and My CI/CD Pipeline Stopped Failing

The author explains how using GPT-4 for a nightly data extraction pipeline caused constant failures due to its non-deterministic nature. Even with strict prompting and temperature settings, the model would occasionally change key names or formatting, breaking the automated workflow. To solve this, the team switched to running smaller local models like Qwen2.5 via Ollama. By using seeded inference on their own hardware, they achieved the consistency needed for a reliable pipeline, finding that while small models lack GPT-4's reasoning depth, they are much better at performing repetitive, structured tasks identically every time.

2026-04-22 Tags: llm, automation, determinism, local models, data extraction, reliability, cicd, benjamin nweke by klotz

IBM Releases Granite 4.0 3B Vision: A New Vision Language Model for Enterprise Grade Document Data Extraction

IBM has introduced Granite 4.0 3B Vision, a specialized vision-language model (VLM) engineered for high-fidelity enterprise document data extraction. Unlike monolithic multimodal models, this release uses a modular LoRA adapter architecture, adding approximately 0.5B parameters to the Granite 4.0 Micro base model. This design allows for efficient dual-mode deployment, activating vision capabilities only when multimodal processing is required. The model excels at converting complex visual elements, such as charts and tables, into structured machine-readable formats like JSON, HTML, and CSV. By utilizing a high-resolution tiling mechanism and a DeepStack architecture for improved spatial alignment, Granite 4.0 3B Vision achieves impressive accuracy in tasks like Key-Value Pair extraction and chart reasoning, ranking highly on industry benchmarks.

2026-04-08 Tags: ibm, granite 4.0 3b, llm, vlm, document, data extraction, lora, chartnet, deepstack by klotz

Top 7 n8n Workflow Templates for Data Science

This article details seven pre-built n8n workflows designed to streamline common data science tasks, including data extraction, cleaning, model training, and deployment.

2026-01-08 Tags: n8n, workflows, data science, automation, no-code, data extraction, data cleaning, machine learning, api by klotz

Using Google’s LangExtract and Gemma for Structured Data Extraction

Extracting structured information effectively and accurately from long unstructured text with LangExtract and LLMs. This article explores Google’s LangExtract framework and its open-source LLM, Gemma 3, demonstrating how to parse an insurance policy to surface details like exclusions.

2025-08-27 Tags: data science, large language models, llm, machine learning, structured data, langextract, gemma, data extraction by klotz

Converting Unstructured Data into a Knowledge Graph Using an End-to-End Pipeline

This article details a step-by-step guide on building a knowledge graph from plain text using an LLM-powered pipeline. It covers concepts like Subject-Predicate-Object triples, text chunking, and LLM prompting to extract structured information.

2025-04-18 Tags: knowledge graph, llm, natural language processing, data extraction, text chunking, rdf, data engineering unstructured data by klotz

Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper

A popular and actively maintained open-source web crawling library for LLMs and data extraction, offering advanced features like structured data extraction, browser control, and markdown generation.

2025-04-03 Tags: web crawler, scraper, llm, data extraction, open-source, python, crawl4ai, quixey by klotz

ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON

ReaderLM-v2 is a 1.5B parameter language model developed by Jina AI, designed for converting raw HTML into clean markdown and JSON with high accuracy and improved handling of longer contexts. It supports multilingual text in 29 languages and offers advanced features such as direct HTML-to-JSON extraction. The model improves upon its predecessor by addressing issues like repetition in long sequences and enhancing markdown syntax generation.

2025-02-15 Tags: readerlm-v2, jina ai, html, markdown, json, llm, data extraction, text extraction, scraper by klotz

Parsera: Lightweight Python Library for Web Scraping with LLMs

Parsera is a simple and fast Python library for scraping websites using Large Language Models (LLMs). It's designed to be lightweight and minimize token usage for speed and cost efficiency.

2024-08-16 Tags: python, web, scraper, llm, data extraction, parsera by klotz

Document Parsing Using Large Language Models — With Code

This article explores the use of large language models (LLMs) for document parsing, offering a more powerful and flexible alternative to traditional methods like regular expressions. It discusses the workflow involved in processing documents like research papers using LLMs, highlighting the benefits and advantages of this approach.

2024-07-25 Tags: document, pasring llm, regular expressions, data extraction, production engineering by klotz

Triplex — SOTA LLM for Knowledge Graph Construction

Triplex is an open-source model that efficiently converts unstructured data into structured knowledge graphs at a fraction of the cost of existing methods. It outperforms GPT-4o in both cost and performance, making knowledge graph construction more accessible.

2024-07-23 Tags: knowledge graph, llm, triplex, data extraction, unstructured data, foss by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: data extraction*

Linked Tags

Related Tags